On the unsupervised analysis of domain-specific Chinese texts.

نویسندگان

  • Ke Deng
  • Peter K Bol
  • Kate J Li
  • Jun S Liu
چکیده

With the growing availability of digitized text data both publicly and privately, there is a great need for effective computational tools to automatically extract information from texts. Because the Chinese language differs most significantly from alphabet-based languages in not specifying word boundaries, most existing Chinese text-mining methods require a prespecified vocabulary and/or a large relevant training corpus, which may not be available in some applications. We introduce an unsupervised method, top-down word discovery and segmentation (TopWORDS), for simultaneously discovering and segmenting words and phrases from large volumes of unstructured Chinese texts, and propose ways to order discovered words and conduct higher-level context analyses. TopWORDS is particularly useful for mining online and domain-specific texts where the underlying vocabulary is unknown or the texts of interest differ significantly from available training corpora. When outputs from TopWORDS are fed into context analysis tools such as topic modeling, word embedding, and association pattern finding, the results are as good as or better than that from using outputs of a supervised segmentation method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning

Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...

متن کامل

A Branching Strategy For Unsupervised Aspect-based Sentiment Analysis

One of the most recent opinion mining research directions falls in the extraction of polarities referring to specific entities (called “aspects”) contained in the analyzed texts. The detection of such aspects may be very critical especially when the domain which documents belong to is unknown. Indeed, while in some contexts it is possible to train domain-specific models for improving the effect...

متن کامل

Unsupervised Domain Adaptation for Joint Segmentation and POS-Tagging

Sophisticated models have been developed for joint word segmentation and part-of-speech tagging, with increasing accuracies reported on the Chinese Treebank data. These systems, which rely on supervised learning, typically perform worse on texts from a different domain, for which little annotation is available. We consider self-training and character clustering for domain adaptation. Both metho...

متن کامل

Towards Unsupervised Approaches For Aspects Extraction

One of the most recent opinion mining research directions falls in the extraction of polarities referring to specific entities (called “aspects”) contained in the analyzed texts. The detection of such aspects may be very critical especially when the domain which documents belong to is unknown. Indeed, while in some contexts it is possible to train domain-specific models for improving the effect...

متن کامل

The GOD model

GOD (General Ontology Discovery) is an unsupervised system to extract semantic relations among domain specific entities and concepts from texts. Operationally, it acts as a search engine returning a set of true predicates regarding the query instead of the usual ranked list of relevant documents. Our approach relies on two basic assumptions: (i) paradigmatic relations can be established only am...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Proceedings of the National Academy of Sciences of the United States of America

دوره 113 22  شماره 

صفحات  -

تاریخ انتشار 2016